98 research outputs found
Roller: A novel approach to web information extraction
The research regarding web information extraction focuses
on learning rules to extract some selected information from web documents.
Many proposals are ad-hoc and cannot benefit from the advances
in machine learning; furthermore, they are likely to fade away as theWeb
evolves and their intrinsic assumptions are not satisfied. Some authors
have explored transforming web documents into relational data and then
using techniques that got inspiration from inductive logic programming.
In theory, such proposals should be easier to adapt as the Web evolves
because they build on catalogues of features that can be adapted without
changing the proposals themselves. Unfortunately, they are difficult
to scale as the number of documents or features increases. In the general
field of machine learning, there are propositio-relational proposals
that attempt to provide effective and efficient means to learn from relational
data using propositional techniques, but they have seldom been
explored regarding web information extraction. In this article, we present
a new proposal called Roller: it relies on a search procedure that uses
a dynamic flattening technique to explore the context of the nodes that
provide the information to be extracted; it is configured with an open
catalogue of features, so that it can adapt to the evolution of the Web; it
also requires a base learner and a rule scorer, which helps it benefit from
the continuous advances in machine learning. Our experiments confirm
that it outperforms other state-of-the-art proposals in terms of effectiveness
and that it is very competitive in terms of efficiency; we have also
confirmed that our conclusions are solid from a statistical point of view.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
On Learning Web Information Extraction Rules with TANGO
The research on Enterprise Systems Integration focuses on proposals to support
business processes by re-using existing systems. Wrappers help re-use web ap plications that provide a user interface only. They emulate a human user who
interacts with them and extracts the information of interest in a structured for mat. In this article, we present TANGO, which is our proposal to learn rules
to extract information from semi-structured web documents with high precision
and recall, which is a must in the context of Enterprise Systems Integration. It
relies on an open catalogue of features that helps map the input documents into
a knowledge base in which every DOM node is represented by means of HTML,
DOM, CSS, relational, and user-defined features. Then a procedure with many
variation points is used to learn extraction rules from that knowledge base; the
variation points include heuristics that range from how to select a condition to
how to simplify the resulting rules. We also provide a systematic method to help
re-configure our proposal. Our exhaustive experimentation proves that it beats
others regarding effectiveness and is efficient enough for practical purposes. Our
proposal was devised to be as configurable as possible, which helps adapt it to
particular web sites and evolve it when necessary.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
On Extracting Information from Semi-structured Deep Web Documents
Some software agents need information that is provided by
some web sites, which is difficult if they lack a query API. Information
extractors are intended to extract the information of interest automati cally and offer it in a structured format. Unfortunately, most of them rely
on ad-hoc techniques, which make them fade away as the Web evolves.
In this paper, we present a proposal that relies on an open catalogue of
features that allows to adapt it easily; we have also devised an optimi sation that allows it to be very efficient. Our experimental results prove
that our proposal outperforms other state-of-the-art proposals.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
On validating web information extraction proposals
Many people who have to make informed decisions in today’s always-on culture use information extractors
to feed their systems with information that comes from human-friendly documents. Unfortunately, many
proposals that validate information extractors have deficiencies that make it difficult to perform homogeneous
comparisons, confirm or refute performance hypotheses, or draw unbiased conclusions. Consequently, it is
very difficult to select the best-performing proposal on a sound basis. The state-of-the-art validation method
overcomes many deficiencies in the previous proposals, but still overlooks the following issues: completeness
of the validation datasets, that is, whether they provide a complete set of annotations or not; structure
of the information, that is, whether they check the structure of the record instances extracted or just the
attribute instances; and, finally, how extractions and annotations are matched. The decisions made regarding
the previous issues have an impact on the effectiveness results. In this article, we have exhaustively analysed
the literature and we have also highlighted the main weaknesses to tackle. We present a guideline and a method
to compute the effectiveness, which complements and enhances the state-of-the-art validation method.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137
Enterprise Information Integration: New Approaches to Web Information Extraction
La manera de entender la información ha cambiado radicalmente en las últimas décadas gracias a la Web, que impulsa a las personas a hacer uso de Internet a un ritmo cada vez más vertiginoso. No es de extrañar, pues, que se haya convertido en uno de los canales de distribución de datos más usados y universalmente accesible. Sin embargo, los datos por sí solos no tienen suficiente valor; es necesario convertirlos en información a partir de la cual se pueda inferir conocimiento útil. Éste es el propósito de la inteligencia de negocio, que involucra un proceso de integración y transformación de datos en información y posterior obtención de conocimiento con el objetivo de llevar a cabo una toma de decisiones eficaz. Para que ese proceso de integración y transformación de datos tenga lugar, es necesario hacer uso de extractores de información, que son las herramientas que permiten extraer datos de la Web y dotarlos de estructura y semántica de modo que puedan ser interpretados por las personas o incorporados en procesos de negocios automáticos con el objetivo de explotarlos de una forma inteligente. En esta tesis nos centramos en el aprendizaje de reglas para extraer información de documentos web semi-estructurados y en cómo evaluar diferentes propuestas con el objetivo de obtener un ranking de una forma totalmente automática. Nuestras dos propuestas de extracción de información son TANGO y ROLLER; ambas están basadas en un catálogo abierto de características y en técnicas inductivas. Nuestra propuesta para obtener rankings se llama VENICE; proporciona un método automático, abierto y agnóstico que está basado en técnicas estadísticas. Esperamos que nuestras contribuciones en esta tesis puedan ser de utilidad tanto a investigadores como profesionales y que ayuden a reducir los costes en los proyectos que requieren extraer información de la Web
On improving FOIL Algorithm
FOIL is an Inductive Logic Programming Algorithm
to discover first order rules to explain the patterns involved
in a domain of knowledge. Domains as Information Retrieval
or Information Extraction are handicaps for FOIL due to the
huge amount of information it needs manage to devise the rules.
Current solutions to problems in these domains are restricted to
devising ad hoc domain dependent inductive algorithms that use
a less-expressive formalism to code rules.
We work on optimising FOIL learning process to deal with
such complex domain problems while retaining expressiveness.
Our hypothesis is that changing the information gain scoring
function, used by FOIL to decide how rules are learnt, can reduce
the number of steps the algorithm performs. We have analysed 15
scoring functions, normalised them into a common notation and
checked a test in which they are computed. The learning process
will be evaluated according to its efficiency, and the quality of
the rules according to their precision, recall, complexity and
specificity. The results reinforce our hypothesis, demonstrating
that replacing the information gain can optimise both the FOIL
algorithm execution and the learnt rules.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-
Optimising FOIL by new scoring functions
FOIL is an Inductive Logic Programming Algorithm to dis cover first order rules to explain the patterns involved in a domain of
knowledge. Domains with a huge amount of information are handicaps
for FOIL due to the explosion of the search of space to devise the rules.
Current solutions to problems in these domains are restricted to devising
ad hoc domain dependent inductive algorithms that use a less-expressive
formalism to code rules.
We work on optimising FOIL learning process to deal with such complex
domain problems while retaining expressiveness. Our hypothesis is that
changing the Information Gain scoring function, used by FOIL to de cide how rules are learnt, can reduce the number of steps the algorithm
performs. We have analysed 15 scoring functions, normalised them into
a common notation and checked a test in which they are computed.
The learning process will be evaluated according to its efficiency, and
the quality of the rules according to their precision, recall, complexity
and specificity. The results reinforce our hypothesis, demonstrating that
replacing the Information Gain can optimise both the FOIL algorithm
execution and the learnt rules
Editorial
Los estudios sociales sobre el mundo del trabajo coinciden en revelar que, para que exista una sociedad decente, es necesario que se promuevan unas condiciones básicas para su desarrollo que permita a todos sus integrantes unos mínimos de garantías para poder llevar una vida digna
ARIEX: Automated ranking of information extractors
Information extractors are used to transform the user-friendly information in a web document into structured
information that can be used to feed a knowledge-based system. Researchers are interested in ranking them
to find out which one performs the best. Unfortunately, many rankings in the literature are deficient. There
are a number of formal methods to rank information extractors, but they also have many problems and have
not reached widespread popularity. In this article, we present ARIEX, which is an automated method to rank
web information extraction proposals. It does not have any of the problems that we have identified in the
literature. Our proposal shall definitely help authors make sure that they have advanced the state of the art
not only conceptually, but from an empirical point of view; it shall also help practitioners make informed
decisions on which proposal is the most adequate for a particular problem.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
On Member Labelling in Social Networks
Software agents are increasingly used to search for experts, recommend
resources, assess opinions, and other similar tasks in the context of social networks,
which requires to have accurate information that describes the features of the members
of the network. Unfortu-nately, many member profiles are incomplete, which has
motivated many authors to work on automatic member labelling, that is, on techniques
that can infer the null features of a member from his or her neighbour-hood. Current
proposals are based on local or global approaches; the former compute predictors from
local neighbourhoods, whereas the lat-ter analyse social networks as a whole. Their
main problem is that they tend to be inefficient and their effectiveness degrades
significantly as the percentage of null labels increases. In this paper, we present Katz,
which is a novel hybrid proposal to solve the member labelling problem using neural
networks. Our experiments prove that it outperforms other pro-posals in the literature
in terms of both effectiveness and efficiency.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-
- …